Collect metrics about the active/idle connections to ES nodes #141434

gsoldevila · 2022-09-22T13:45:03Z

As part of #134362, we want to have more visibility on the amount of open connections to ES nodes, for our different deployments.

With a cloud-first mindset, we are adding a new collector that will retrieve socket information from AgentManager, exposing this information on the /api/stats endpoint, so that it can be consumed by the Kibana Metric Beat, and sent over to our monitoring cluster.

Here's an example of what the new properties look like (extracted from a local environment):

elasticmachine · 2022-09-22T13:45:06Z

Pinging @elastic/kibana-core (Team:Core)

TinaHeiligers

A few comments and a whole bunch of questions but I'll review again once CI goes green.

There will be some follow-up work to add these metrics to Cloud

TinaHeiligers · 2022-09-22T22:48:33Z

packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/agent_manager.test.ts

+
+      expect(httpAgents.size).toEqual(2);
+      expect(httpsAgents.size).toEqual(2);
+      expect(httpAgents.has(agent1)).toEqual(true);


Alternative:

expect([...httpAgents]).toEqual(expect.arrayContaining([agent1, agent2])); expect([...httpAgents]).not.toEqual(expect.arrayContaining([agent3, agent4])); expect([...httpsAgents]).toEqual(expect.arrayContaining([agent3, agent4])); expect([...httpsAgents]).not.toEqual(expect.arrayContaining([agent1, agent2]));

Don't think that would work

expect.arrayContaining(array) matches a received array which contains **all** of the elements in the expected array

expect([...httpsAgents]).not.toEqual(expect.arrayContaining([agent1, agent2])); would be passing for [agent1, agent3] AFAIK

Alright! I'll simplify what can be simplified then.

TinaHeiligers · 2022-09-22T23:03:52Z

packages/core/metrics/core-metrics-collectors-server-internal/src/elasticsearch_client.test.ts

+    const httpAgents = new Set<HttpAgent>([new HttpAgent(), new HttpAgent()]);
+    const httpsAgents = new Set<HttpsAgent>([new HttpsAgent(), new HttpsAgent()]);
+
+    AgentManagerMock.mockImplementationOnce(() => ({


In the longer term, we might need a proper mock for AgentManager but for now, it's probably ok to inline here.

TinaHeiligers · 2022-09-22T23:31:29Z

packages/core/metrics/core-metrics-collectors-server-internal/src/elasticsearch_client.ts

+  }
+
+  public reset() {
+    // TODO check if we have to implement


We'd probably want to reset the collector on rolling restarts, to ensure we're not carrying over any stale metrics. Even if not needed, it's probably a good idea to reset the collector anyway.

client metrics are kinda like os/cgroups metrics though, in the way that the actual collector doesn't hold any data and just delegates to a lower-level actor, so I'm not sure to see what we could / should be cleaning here?

That answer the question, thanks!

TinaHeiligers · 2022-09-22T23:53:40Z

packages/core/metrics/core-metrics-collectors-server-internal/src/elasticsearch_client.test.ts

+    expect(getAgentsSocketsStats).toHaveBeenCalledTimes(2);
+    expect(getAgentsSocketsStats).toHaveBeenNthCalledWith(1, httpAgents);
+    expect(getAgentsSocketsStats).toHaveBeenNthCalledWith(2, httpsAgents);
+    expect(metrics).toMatchInlineSnapshot(`


Using snapshots in the metrics service has, in my experience, been a nightmare with flaky tests. We're already seeing failures for https://github.com/elastic/kibana/pull/141434/files#diff-3876c2ad50be4ead02241bceff9d5a9ee0d493a5fb53ff27c628cecc5d2716a6R72.

TinaHeiligers · 2022-09-22T23:54:14Z

...es/core/metrics/core-metrics-collectors-server-internal/src/get_agents_sockets_stats.test.ts

+    const agents = new Set<Agent>([new Agent(), new Agent()]);
+
+    const stats = getAgentsSocketsStats(agents);
+    expect(stats).toMatchInlineSnapshot(`


Using snapshots has, in my experience, been flaky with the metrics service. We should be proactive and create test fixtures for these, similarly to how we test the process and event loop delays.

pgayvallet

Overall implementation looking good.

Just a bunch of remarks and NITs

pgayvallet · 2022-09-23T07:19:02Z

packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/agent_manager.test.ts

+      expect(httpAgents.size).toEqual(2);
+      expect(httpsAgents.size).toEqual(2);
+      expect(httpAgents.has(agent1)).toEqual(true);
+      expect(httpAgents.has(agent2)).toEqual(true);
+      expect(httpAgents.has(agent3)).toEqual(false);
+      expect(httpAgents.has(agent4)).toEqual(false);
+      expect(httpsAgents.has(agent1)).toEqual(false);
+      expect(httpsAgents.has(agent2)).toEqual(false);
+      expect(httpsAgents.has(agent3)).toEqual(true);
+      expect(httpsAgents.has(agent4)).toEqual(true);


NIT: I think the order in these arrays are determined by the order of creation, right? So at this point, why just not check for exact array content?

expect(httpAgents).toEqual([agent1, agent2]); expect(httpsAgents).toEqual([agent3, agent4]);

pgayvallet · 2022-09-23T07:20:39Z

packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/agent_manager.ts

+  public getHttpAgents(): Set<NetworkAgent> {
+    return this.httpStore;
+  }
+
+  public getHttpsAgents(): Set<NetworkAgent> {
+    return this.httpsStore;
+  }


I think we can return Set<HttpAgent> for getHttpAgents and Set<HttpsAgent> for getHttpsAgents, can't we? Would be a little more precise.

N/A as I've merged both Agent types into a single set (commit coming soon).

pgayvallet · 2022-09-23T07:22:41Z

packages/core/elasticsearch/core-elasticsearch-server-internal/src/types.ts

@@ -12,6 +12,7 @@ import type {
  ElasticsearchServiceStart,
  ElasticsearchServiceSetup,
 } from '@kbn/core-elasticsearch-server';
+import { AgentManager } from '@kbn/core-elasticsearch-client-server-internal';


NIT: import type

pgayvallet · 2022-09-23T07:25:16Z

packages/core/elasticsearch/core-elasticsearch-server-mocks/src/elasticsearch_service.mock.ts

@@ -94,6 +95,7 @@ const createInternalSetupContractMock = () => {
      level: ServiceStatusLevels.available,
      summary: 'Elasticsearch is available',
    }),
+    agentManager: new AgentManager(),


+1 to @TinaHeiligers's remark: we'd ideally want a proper mocked version here instead of delegating to the actual implementation.

Note that given it is only exposed on the internal contract / mock, the impact is not that significant.

pgayvallet · 2022-09-23T07:29:58Z

packages/core/metrics/core-metrics-collectors-server-internal/src/elasticsearch_client.test.ts

+jest.mock('@kbn/core-elasticsearch-client-server-internal');
+jest.mock('./get_agents_sockets_stats');
+
+const AgentManagerMock = AgentManager as jest.MockedClass<typeof AgentManager>;
+const getAgentsSocketsStatsMock = getAgentsSocketsStats as jest.MockedFunction<
+  typeof getAgentsSocketsStats
+>;


NIT: For consistency, would be better to use the pattern of declaring a test's mocks in a [file].test.mocks.ts file instead, and importing the mocked things from it, e.g packages/core/metrics/core-metrics-collectors-server-internal/src/os.test.mocks.ts

pgayvallet · 2022-09-23T07:42:11Z

packages/core/metrics/core-metrics-collectors-server-internal/src/elasticsearch_client.ts

+  }
+
+  public reset() {
+    // TODO check if we have to implement


client metrics are kinda like os/cgroups metrics though, in the way that the actual collector doesn't hold any data and just delegates to a lower-level actor, so I'm not sure to see what we could / should be cleaning here?

pgayvallet · 2022-09-23T07:51:03Z

packages/core/metrics/core-metrics-collectors-server-internal/src/get_agents_sockets_stats.ts

+      totalActiveSockets += sockets?.length ?? 0;
+      nodesWithActiveSockets[node] = (nodesWithActiveSockets[node] ?? 0) + (sockets?.length ?? 0);


NIT: I would set sockets?.length ?? 0; into a variable to avoid repeating it (same with freeSockets?.length ?? 0 below).

pgayvallet · 2022-09-23T07:55:19Z

packages/core/metrics/core-metrics-server-internal/src/metrics_service.ts

+    this.metricsCollector = new OpsMetricsCollector(
+      http.server,
+      elasticsearchService.agentManager,
+      {
+        logger: this.logger,
+        ...config.cGroupOverrides,
+      }
+    );


I was gonna ask why you didnt add agentManager to the second param, but looking at it, I see we're using the OpsMetricsCollectorOptions both for OpsMetricsCollector and OsMetricsCollector construction.

Outside of the scope of this PR, but we should fix that, and have OpsMetricsCollector use its own single structure / type.

Created #141922 for it

pgayvallet · 2022-09-23T07:57:54Z

packages/core/metrics/core-metrics-server-internal/src/metrics_service.test.ts

  describe('#start', () => {
    it('invokes setInterval with the configured interval', async () => {
-      await metricsService.setup({ http: httpMock });
+      await metricsService.setup({ http: httpMock, elasticsearchService: esServiceMock });
      await metricsService.start();



We're actually missing a test to assert what parameters the OpsMetricsCollector instance is created with during #setup.

Optional given it wasn't here in the first place, but we could add one.

pgayvallet · 2022-09-23T07:59:52Z

packages/core/metrics/core-metrics-server/src/metrics.ts

+  /** Number of HTTP Agents that have are currently being used by the ES-js client */
+  agents: number;


NIT: can be http or https. would just remove this word from the description

rudolf · 2022-09-23T08:44:04Z

packages/core/metrics/core-metrics-server/src/metrics.ts

+export interface ElasticsearchClientsMetrics {
+  /** Number of HTTP Agents that have are currently being used by the ES-js client */
+  agents: number;
+  /** Number of ES instances (or proxies) that ES-js client is connecting to */


NIT:

Suggested change

/** Number of ES instances (or proxies) that ES-js client is connecting to */

/** Number of ES nodes that ES-js client is connecting to */

Kibana doesn't support connecting through a proxy.

rudolf · 2022-09-23T08:57:49Z

packages/core/metrics/core-metrics-server/src/metrics.ts

+export interface ElasticsearchClientsMetricsByProtocol {
+  http: ElasticsearchClientsMetrics;
+  https: ElasticsearchClientsMetrics;
+}


What's the value in having separate metrics for http and https?
Is it even possible to configure Kibana to use https and http nodes? Is it even possible to run an Elasticsearch cluster with mixed http / https nodes?

As a first step we just want to use this data to understand and optimize the default performance on cloud. But if this is useful to us it's useful to customers too. But if there's two keys then all the dashboards built on this data will have to have two queries and it will always show e.g. "http.averageActiveSocketsPerNode" and "https.averageActiveSocketsPerNode" even if only one will have data. So this kinda makes the dashboard noisy.

I think it's technically possible, at least from a configuration standpoint. When users specify the hosts: property they could be introducing mixed http and https hosts.

I was thinking that we would mainly rely on "https" for dashboards, which will be the normal for our cloud deployments.

Then, if we find a faulty deployment with degraded performance, we could take a look and see if there are open sockets in the 'http' protocol? I'm not really sure how helpful it can be.

If you think it brings more noise than value I can simply remove the 'http' one, or try to merge them together into a single "store". Alternatively, from the Kibana metrics beat, we could simply consume the 'https' one. This way, the /api/stats will continue to expose both (which could be useful for investigations on self-managed), and the monitoring cluster will only ingest the 'https' one, WDYT?

I can imagine it might be possible to configure mixed http/https nodes for Elasticsearch for high availability you might have mixed nodes while you switch nodes in your cluster one at a time to serving over https. But I suspect this is extremely rare as a long term configuration.

Mixed protocols means the elasticsearch.maxSockets configuration option behaves differently and Kibana would have double as many sockets as a pure http or pure https configuration. In such a case only seeing the https sockets in a dashboard would be misleading and I think it would be more useful to see all sockets to all protocols on a dashboard.

So it feels like merging the agents would always provide a more holistic picture of how many sockets your Kibana is opening. I'm not sure if knowing if it's https or http would make us make any different decisions, in some sense it feels like an implementation detail, but we could have a field like protocol: 'http' | 'https' | 'mixed'.

That's a fair point, I'll update my PR accordingly, thanks!

rudolf · 2022-09-29T13:36:08Z

packages/core/metrics/core-metrics-collectors-server-internal/src/get_agents_sockets_stats.ts

+  else if (http) protocol = 'http';
+  else protocol = 'none';
+
+  return {


I'm not sure where it happens but I tested this by creating a lot of idle connections and then getting /api/stats which returns:

... "elasticsearch_client": { "protocol": "https", "connected_nodes": 1, "nodes_with_active_sockets": 0, "nodes_with_idle_sockets": 1, "total_active_sockets": 0, "total_idle_sockets": 4494, "total_queued_requests": 0, "most_active_node_sockets": null, "average_active_sockets_per_node": null, "most_idle_node_sockets": 4494, "average_idle_sockets_per_node": 4494 }, ...

We would probably use a long mapping for these fields so then I think a value of null (like for average_active_sockets_per_node) would cause an indexing error on.

Good catch! I'm adding some UT to cover these edge cases.

pgayvallet

Few remarks, but looking good to me

pgayvallet · 2022-10-03T13:27:01Z

packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/cluster_client.ts

@@ -26,7 +26,7 @@ import { ScopedClusterClient } from './scoped_cluster_client';
 import { getDefaultHeaders } from './headers';
 import { createInternalErrorHandler, InternalUnauthorizedErrorHandler } from './retry_unauthorized';
 import { createTransport } from './create_transport';
-import { AgentManager } from './agent_manager';
+import { AgentFactoryProvider } from './agent_manager';


NIT: import type

pgayvallet · 2022-10-03T13:27:41Z

packages/core/elasticsearch/core-elasticsearch-client-server-internal/src/configure_client.ts

@@ -12,7 +12,7 @@ import type { ElasticsearchClientConfig } from '@kbn/core-elasticsearch-server';
 import { parseClientOptions } from './client_config';
 import { instrumentEsQueryAndDeprecationLogger } from './log_query_and_deprecation';
 import { createTransport } from './create_transport';
-import { AgentManager } from './agent_manager';
+import { AgentFactoryProvider } from './agent_manager';


nit: import type

pgayvallet · 2022-10-03T13:28:16Z

packages/core/elasticsearch/core-elasticsearch-client-server-mocks/src/agent_manager.mocks.ts

+ * Side Public License, v 1.
+ */
+
+import { AgentStore, NetworkAgent } from '@kbn/core-elasticsearch-client-server-internal';


NIT: import type

pgayvallet · 2022-10-03T13:36:22Z

packages/core/metrics/core-metrics-collectors-server-internal/src/elasticsearch_client.ts

+  constructor(private readonly agentStore: AgentStore) {}
+
+  public async collect(): Promise<ElasticsearchClientsMetrics> {
+    return getAgentsSocketsStats(this.agentStore.getAgents());


NIT: return await getAgentsSocketsStats for better stacktraces

pgayvallet · 2022-10-03T13:44:58Z

...e/metrics/core-metrics-collectors-server-internal/src/get_agents_sockets_stats.test.mocks.ts

+  jest.doMock('http');
+  const agent = new HttpAgent();
+  return Object.assign(agent, defaults);


What are we doing here exactly? what's the intent of the in-function doMock call?

My goal was to mock HttpAgent and allow overriding a few of its properties (mainly the lists of sockets and freeSockets). There's probably a better way to do it 😬

UPDATE: I need a "real" instance so that the code in getAgentsSocketsStats that checks instanceof HttpsAgent does not fail for the test.

UPDATE 2: I removed the doMock(..) statements. They're unnecessary cause I don't need to mock any methods of the class afterwards.

pgayvallet · 2022-10-04T06:29:52Z

packages/core/metrics/core-metrics-collectors-server-internal/src/get_agents_sockets_stats.ts

+  let protocol: ElasticsearchClientProtocol;
+
+  if (http && https) protocol = 'mixed';
+  else if (https) protocol = 'https';
+  else if (http) protocol = 'http';
+  else protocol = 'none';


NIT: The dark side of the inlining is not strong in this one.

const protocol: ElasticsearchClientProtocol = http ? https ? 'mixed' : 'http' ? https ? 'https' : 'none';

kibana-ci · 2022-10-04T11:25:25Z

💛 Build succeeded, but was flaky

Buildkite Build
Commit: 6718285

Failed CI Steps

Test Failures

[job] [logs] FTR Configs #36 / analytics instrumented events from the browser Loaded Dashboard full loaded dashboard should emit the "Loaded Dashboard" event when done loading complex dashboard
[job] [logs] FTR Configs #36 / analytics instrumented events from the browser Loaded Dashboard full loaded dashboard should emit the "Loaded Dashboard" event when done loading complex dashboard

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id	before	after	diff
`@kbn/core-elasticsearch-client-server-internal`	11	13	+2
`@kbn/core-elasticsearch-client-server-mocks`	32	34	+2
`@kbn/core-metrics-collectors-server-internal`	20	25	+5
`@kbn/core-metrics-server-internal`	5	6	+1
`@kbn/core-metrics-server-mocks`	7	19	+12
total			+22

Public APIs missing exports

Total count of every type that is part of your API that should be exported but is not. This will cause broken links in the API documentation system. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats exports for more detailed information.

id	before	after	diff
`@kbn/core-elasticsearch-client-server-internal`	1	2	+1

Unknown metric groups

API count

id	before	after	diff
`@kbn/core-elasticsearch-client-server-internal`	13	15	+2
`@kbn/core-elasticsearch-client-server-mocks`	36	38	+2
`@kbn/core-metrics-collectors-server-internal`	24	29	+5
`@kbn/core-metrics-server`	48	62	+14
`@kbn/core-metrics-server-internal`	5	6	+1
`@kbn/core-metrics-server-mocks`	7	19	+12
`core`	2686	2687	+1
total			+37

History

💔 Build #77838 failed 63a4314
💔 Build #77830 failed 787b250
💚 Build #77395 succeeded d5e9bce
💛 Build #77149 was flaky d2ca429

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

…c#141434) * Collect metrics about the connections from esClient to ES nodes * Misc enhancements following PR remarks and comments * Fix UTs * Fix mock typings * Minimize API surface, fix mocks typings * Fix incomplete mocks * Fix renameed agentManager => agentStore in remaining UT * Cover edge cases for getAgentsSocketsStats() * Misc NIT enhancements * Revert incorrect import type statements

[PR #141434](#141434) exposes a bunch of metrics related to the Elasticsearch Client in the `/api/stats` endpoint. While all these stats are interesting, some of them might be less relevant than others right now. Let's start by exposing only those stats that are more critical from a monitoring standpoint. <img width="440" alt="image" src="https://user-images.githubusercontent.com/25349407/201688243-4e33cd88-5fa2-48b7-b8ca-2fd175271adc.png">

Collect metrics about the connections from esClient to ES nodes

85a08e9

gsoldevila requested a review from a team as a code owner September 22, 2022 13:45

gsoldevila added backport:skip This commit does not require backporting and removed backport:prev-minor Backport to (8.x) the previous minor version (i.e. one version back from main) labels Sep 22, 2022

TinaHeiligers reviewed Sep 23, 2022

View reviewed changes

pgayvallet reviewed Sep 23, 2022

View reviewed changes

rudolf requested changes Sep 23, 2022

View reviewed changes

gsoldevila mentioned this pull request Sep 27, 2022

OpsMetricsCollector and OsMetricsCollector share the same constructor options interface #141922

Closed

gsoldevila added 8 commits September 27, 2022 18:09

Misc enhancements following PR remarks and comments

f0ef09a

Merge branch 'main' into kbn-134362-monitor-esclient-sockets

bd9d77e

Fix UTs

bdeaed9

Fix mock typings

1b8dafc

Minimize API surface, fix mocks typings

2ea1ab2

Merge branch 'main' into kbn-134362-monitor-esclient-sockets

ad1effa

Fix incomplete mocks

0bd4e59

Fix renameed agentManager => agentStore in remaining UT

c17fb68

rudolf reviewed Sep 29, 2022

View reviewed changes

gsoldevila added 2 commits September 30, 2022 14:54

Cover edge cases for getAgentsSocketsStats()

d2ca429

Merge branch 'main' into kbn-134362-monitor-esclient-sockets

d5e9bce

gsoldevila requested review from rudolf, TinaHeiligers and pgayvallet October 3, 2022 08:27

pgayvallet approved these changes Oct 4, 2022

View reviewed changes

gsoldevila added 4 commits October 4, 2022 11:18

Misc NIT enhancements

77f057e

Merge branch 'main' into kbn-134362-monitor-esclient-sockets

787b250

Revert incorrect import type statements

63a4314

Merge branch 'main' into kbn-134362-monitor-esclient-sockets

6718285

rudolf approved these changes Oct 4, 2022

View reviewed changes

gsoldevila merged commit 25b79a9 into elastic:main Oct 4, 2022

gsoldevila mentioned this pull request Oct 4, 2022

Add maxIdleSockets and idleSocketTimeout to Elasticsearch config #142019

Merged

9 tasks

This was referenced Nov 9, 2022

Add esclient stats elastic/beats#33621

Merged

Remove non-essential ES-client stats from /api/stats #145120

Merged

		totalActiveSockets += sockets?.length ?? 0;
		nodesWithActiveSockets[node] = (nodesWithActiveSockets[node] ?? 0) + (sockets?.length ?? 0);

		/** Number of HTTP Agents that have are currently being used by the ES-js client */
		agents: number;

	/** Number of ES instances (or proxies) that ES-js client is connecting to */
	/** Number of ES nodes that ES-js client is connecting to */

Collect metrics about the active/idle connections to ES nodes #141434

Collect metrics about the active/idle connections to ES nodes #141434

Conversation

gsoldevila commented Sep 22, 2022 • edited Loading

elasticmachine commented Sep 22, 2022

TinaHeiligers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgayvallet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsoldevila Sep 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgayvallet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gsoldevila Oct 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kibana-ci commented Oct 4, 2022 • edited Loading

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

Metrics [docs]

Public APIs missing comments

Public APIs missing exports

API count

History

gsoldevila commented Sep 22, 2022 •

edited

Loading

gsoldevila Sep 23, 2022 •

edited

Loading

gsoldevila Oct 4, 2022 •

edited

Loading

kibana-ci commented Oct 4, 2022 •

edited

Loading